10 research outputs found
Students taught by multimodal teachers are superior action recognizers
The focal point of egocentric video understanding is modelling hand-object
interactions. Standard models -- CNNs, Vision Transformers, etc. -- which
receive RGB frames as input perform well, however, their performance improves
further by employing additional modalities such as object detections, optical
flow, audio, etc. as input. The added complexity of the required
modality-specific modules, on the other hand, makes these models impractical
for deployment. The goal of this work is to retain the performance of such
multimodal approaches, while using only the RGB images as input at inference
time. Our approach is based on multimodal knowledge distillation, featuring a
multimodal teacher (in the current experiments trained only using object
detections, optical flow and RGB frames) and a unimodal student (using only RGB
frames as input). We present preliminary results which demonstrate that the
resulting model -- distilled from a multimodal teacher -- significantly
outperforms the baseline RGB model (trained without knowledge distillation), as
well as an omnivorous version of itself (trained on all modalities jointly), in
both standard and compositional action recognition.Comment: Extended abstract accepted at the 2nd Ego4D Workshop @ ECCV 202
Linking Surface Facts to Large-Scale Knowledge Graphs
Open Information Extraction (OIE) methods extract facts from natural language
text in the form of ("subject"; "relation"; "object") triples. These facts are,
however, merely surface forms, the ambiguity of which impedes their downstream
usage; e.g., the surface phrase "Michael Jordan" may refer to either the former
basketball player or the university professor. Knowledge Graphs (KGs), on the
other hand, contain facts in a canonical (i.e., unambiguous) form, but their
coverage is limited by a static schema (i.e., a fixed set of entities and
predicates). To bridge this gap, we need the best of both worlds: (i) high
coverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of
KGs. In order to achieve this goal, we propose a new benchmark with novel
evaluation protocols that can, for example, measure fact linking performance on
a granular triple slot level, while also measuring if a system has the ability
to recognize that a surface form has no match in the existing KG. Our extensive
evaluation of several baselines show that detection of out-of-KG entities and
predicates is more difficult than accurate linking to existing ones, thus
calling for more research efforts on this difficult task. We publicly release
all resources (data, benchmark and code) on
https://github.com/nec-research/fact-linking
Multimodal Distillation for Egocentric Action Recognition
The focal point of egocentric video understanding is modelling hand-object
interactions. Standard models, e.g. CNNs or Vision Transformers, which receive
RGB frames as input perform well. However, their performance improves further
by employing additional input modalities that provide complementary cues, such
as object detections, optical flow, audio, etc. The added complexity of the
modality-specific modules, on the other hand, makes these models impractical
for deployment. The goal of this work is to retain the performance of such a
multimodal approach, while using only the RGB frames as input at inference
time. We demonstrate that for egocentric action recognition on the
Epic-Kitchens and the Something-Something datasets, students which are taught
by multimodal teachers tend to be more accurate and better calibrated than
architecturally equivalent models trained on ground truth labels in a unimodal
or multimodal fashion. We further adopt a principled multimodal knowledge
distillation framework, allowing us to deal with issues which occur when
applying multimodal knowledge distillation in a naive manner. Lastly, we
demonstrate the achieved reduction in computational complexity, and show that
our approach maintains higher performance with the reduction of the number of
input views. We release our code at
https://github.com/gorjanradevski/multimodal-distillation.Comment: Accepted at ICCV 2023; Codebase released at
https://github.com/gorjanradevski/multimodal-distillatio
Cohort-derived machine learning models for individual prediction of chronic kidney disease in people living with HIV: a prospective multicentre cohort study.
BACKGROUND
It is unclear whether data-driven machine learning models, which are trained on large epidemiological cohorts, may improve prediction of co-morbidities in people living with HIV.
METHODS
In this proof-of-concept study, we included people living with HIV of the prospective Swiss HIV Cohort Study with a first estimated glomerular filtration rate (eGFR) >60 ml/min/1.73 m2 after January 1, 2002. Our primary outcome was chronic kidney disease (CKD) ─ defined as confirmed decrease in eGFR ≤60 ml/min/1.73 m2 over three months apart. We split the cohort data into a training set (80%), validation set (10%), and test set (10%) ─ stratified for CKD status and follow-up length.
RESULTS
Of 12,761 eligible individuals (median baseline eGFR, 103 ml/min/1.73 m2), 1,192 (9%) developed a CKD after a median of eight years. We used 64 static and 502 time-changing variables: Across prediction horizons and algorithms and in contrast to expert-based standard models, most machine learning models achieved state-of-the-art predictive performances with areas under the receiver operating characteristic curve and precision recall curve ranging from 0.926 to 0.996 and from 0.631 to 0.956, respectively.
CONCLUSIONS
In people living with HIV, we observed state-of-the-art performances in forecasting individual CKD onsets with different machine learning algorithms
A multi-modal AI approach for intuitively instructable autonomous systems
Abstract: We present a multi-modal AI framework to intuitively instruct and control Automated Guided Vehicles. We define a general multi-modal AI architecture, which has a loose coupling between three different AI modules, including spoken language understanding, visual perception and Reinforcement Learning navigation. We use the same multi-modal architecture for two different use cases implemented in two different platforms: an off-road vehicle, which can pick objects, and an indoor forklift that performs automated warehouse inventory. We show how the proposed architecture can be used for a wide range of tasks and can be implemented in different hardware, demonstrating a high degree of modularity